-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-15678][SQL] Not use cache on appends and overwrites #13419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| spark.range(1000).write.mode("overwrite").parquet(path) | ||
| val df = sqlContext.read.parquet(path).cache() | ||
| assert(df.count() == 1000) | ||
| sqlContext.range(10).write.mode("overwrite").parquet(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlContext -> spark
|
Test build #59668 has finished for PR 13419 at commit
|
|
Hi, @sameeragarwal . |
|
@dongjoon-hyun no reason; old habits. I'll fix this. Thanks! :) |
|
I will prefer refreshing the dataset every time a dataset is reloaded but keeping existing ones unchanged. val df1 = sqlContext.read.parquet(dir).cache()
df1.count() // outputs 1000
sqlContext.range(10).write.mode("overwrite").parquet(dir)
val df2 = sqlContext.read.parquet(dir).count() // outputs 10
df2.count() // outputs 10
df1.count() // still outputs 1000 because it was cachedNeither approach is perfectly safe. So I don't have strong preference on either. |
|
@mengxr it seems like overwriting generates new files so we can achieve the same semantics without introducing an additional timestamp. The current solution should respect the contract for old dataframes while making sure that the new ones don't use the cached value. Let me know what you think. |
|
Also cc'ing @davies |
|
Test build #59706 has finished for PR 13419 at commit
|
|
I guess that the caching is done over multiple nodes. If the data for a dataset is updated physically and some of the nodes where the data was cached go down, would the existing |
|
@tejasapatil if the nodes where the data was cached go down, the CacheManager should still consider that data as cached. In that case, the next time the data is accessed, the underlying RDD will be recomputed and cached again. |
|
I ended up creating a small design doc describing the problem and presenting 2 possible solutions at https://docs.google.com/document/d/1h5SzfC5UsvIrRpeLNDKSMKrKJvohkkccFlXo-GBAwQQ/edit?ts=574f717f#. Based on this, we decided in favor of option 2 (#13566) as it is a less intrusive change to the default behavior. I'm going to close this PR for now, but we may revisit this approach (i.e., option 1) for 2.1. |
What changes were proposed in this pull request?
Spark currently incorrectly continues to use cached data even if the underlying data is overwritten.
Current behavior:
Expected behavior:
This patch fixes this bug by modifying the
ListingFileCataloglogic that used to only compare the directory name (as opposed to individual files) while comparing 2 plans. Note that in theory, this could lead to a slight regression (for large number of files) but I didn't notice any regression for micro-benchmarks with 1000s of files.How was this patch tested?
Unit tests for overwrites and appends in
ParquetQuerySuite.